Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vec_proxy_missing() #1928

Closed
wants to merge 2 commits into from
Closed

Conversation

khusmann
Copy link

@khusmann khusmann commented Apr 15, 2024

vec_proxy_missing() adds a new proxy that allows developers to extend the is.na(), na.omit() etc. behavior of their vctrs (as I describe in (#1925). It's useful/needed when vctrs contain special values (or data structures) that represent different types of missingness in their data. In these cases, you need a separate proxy for equal and missing. It also lets you redefine the meaning of complete.

I wanted to take an experimental shot at the implementation here because it didn't seem to hard, and I wanted to play around with the resulting API. I'm extremely satisfied with the result -- this is exactly what I need for implementing a generic vctrs-backed Result<> type in the library I'm working on, and I think could have similar applications elsewhere (like haven::labelled_spss, or declared)

I've added a few tests and documentation updates, but it's only a first-pass. I would be happy to do a deeper & more thorough dive if this is considered for inclusion into vctrs.

Thank you for your consideration!

@khusmann
Copy link
Author

khusmann commented Aug 15, 2024

@lionel- @DavisVaughan

Ok, I know I passionately advocated for this yesterday at conf, but after starting thinking about this again after a long break, I'm embarrassingly reconsidering the need for this (for my use case, at least).

I realized that for my application, it may actually be more theoretically consistent for comparisons with missing values (including missing reasons) to result in missing values. In base R, we have:

1 == NA => NA
NA == NA => NA

similarly:

1 + NA => NA
1 - NA => NA
(etc).

In interlacer, I say "operations on interlaced values are applied to the value channel". This way, when adding two interlaced vectors, only the values are added in the vector and missing values get automatically "masked out". This relies on:

1 + na("reason") => NA
1 - na("reason") => NA

Therefore, equality should also only apply to the value channel, just like for missing values without reasons:

1 == na("reason") => NA
na("reason1") == na("reason2") => NA

But what I was asking for in this PR was the ability to compare missing reasons:

na("reason1") == na("reason2") => false
na("reason1") == na("reason1") => true

But this is actually a different type of comparison, and breaks the interlacer rule if we allow it! (that is, breaks the rule that operations apply on values, masking out missingness). Instead, we should think of it as a new kind of comparison that compares values AND missing values (rather than comparing values and masks out missing). So I think what I actually need is to define a new form of the == and != operators that compare both the value and missingness of a value, and always return a boolean (never missing). Something like:

na("reason1") %==% na("reason2") => false
na("reason1") %==% na("reason1") => true
1 %==% na("reason") => false

This operator would also make sense with unlabelled missingness (raw NA values):

1 %==% NA => false
NA %==% NA => true
na("reason1") %==% NA => false

Anyway, tldr I need to think about this a little more… but I think we'll probably be able to close the PR! (if you have any theory input/perspective on this I'd really appreciate it but no pressure; I know y'all are super busy)

Sorry for the flip flop on this!

khusmann added a commit to khusmann/interlacer that referenced this pull request Aug 16, 2024
The interlacer rule is "operations are applied to the value channel"
and missing values are automatically masked out. the original
behavior did not follow this in order to allow for comparsion
of missing reasons. Instead, for missing reason comparsions
we will need to define new equality operators that compare
both values and missing reason channels.

See discussion here for more info: r-lib/vctrs#1928
khusmann added a commit to khusmann/interlacer that referenced this pull request Sep 5, 2024
The interlacer rule is "operations are applied to the value channel"
and missing values are automatically masked out. the original
behavior did not follow this in order to allow for comparsion
of missing reasons. Instead, for missing reason comparsions
we will need to define new equality operators that compare
both values and missing reason channels.

See discussion here for more info: r-lib/vctrs#1928
khusmann added a commit to khusmann/interlacer that referenced this pull request Sep 5, 2024
The interlacer rule is "operations are applied to the value channel"
and missing values are automatically masked out. the original
behavior did not follow this in order to allow for comparsion
of missing reasons. Instead, for missing reason comparsions
we will need to define new equality operators that compare
both values and missing reason channels.

See discussion here for more info: r-lib/vctrs#1928
@khusmann
Copy link
Author

khusmann commented Sep 5, 2024

The latest version of interlacer (v0.4.0) now has the behavior I describe above, which so far seems to be working well... I'm closing this issue now!

@DavisVaughan
Copy link
Member

I think it makes sense that all the vctrs ops operate on the underlying value, but then you could have special interlacer_equals() operations or something like that (possibly an infix op like you showed) that "know" about the different kinds of missingness as well

It's kind of like how in ivs you can do vctrs::vec_set_difference() and it uses the vctrs definition of sets, but there is also ivs::iv_set_difference() that uses the interval definition of sets

@khusmann
Copy link
Author

khusmann commented Sep 5, 2024

It's kind of like how in ivs you can do vctrs::vec_set_difference() and it uses the vctrs definition of sets, but there is also ivs::iv_set_difference() that uses the interval definition of sets

Good parallel! (and ivs looks super cool btw)

The other nice part of the infix %==% operator is it is I'm finding that even without interlaced types the definition makes sense and is actually pretty useful. It can be used to determine if the values of two columns are identical in values and missing pattern, even with unlabeled missingness:

library(interlacer)

a <- c(1, 2, NA, 3)
b <- c(1, 2, NA, NA)

a == b
#> [1] TRUE TRUE   NA   NA

a %==% b
#> [1]  TRUE  TRUE  TRUE FALSE

# This becomes handy if I want to see if the vectors are identical
# in values and in missing pattern:
all(a %==% b)
#> [1] FALSE
all(a %==% a)
#> [1] TRUE

# By contrast, if I naively use base `==` I get `NA`
all(a == b)
#> [1] NA
all(a == a)
#> [1] NA

# To get the same behavior as all(a %==% b) requires a bit of gymnastics:
all(ifelse(is.na(a == b), is.na(a) & is.na(b), a == b))
#> [1] FALSE

Created on 2024-09-05 with reprex v2.0.2

As far as I'm aware, there aren't any other convenient ways to make this basic kind of value & missingness pattern comparison in base or rlang or vctrs... Right?

Sure, there's all.equal() and identical(), but 1. they operate on the full vector & don't have vectorized versions and 2. the former is for "near" equality, and the latter additionally compares attributes (rather than just values).

@DavisVaughan
Copy link
Member

vec_equal(na_equal = TRUE) offers some of this behavior

a <- c(1, 2, NA, 3)
b <- c(1, 2, NA, NA)

a == b
#> [1] TRUE TRUE   NA   NA

vctrs::vec_equal(a, b)
#> [1] TRUE TRUE   NA   NA

vctrs::vec_equal(a, b, na_equal = TRUE)
#> [1]  TRUE  TRUE  TRUE FALSE

Created on 2024-09-05 with reprex v2.0.2

@khusmann
Copy link
Author

khusmann commented Sep 5, 2024

Ah nice! So I can simplify / generalize the definition of x %==% y to be:

`%==%` <- function(x, y) {
  args <- vec_cast_common(x, y)
  vec_equal(vec_proxy(args[[1]]), vec_proxy(args[[2]]), na_equal = TRUE)
}

(Note vec_proxy() is used here instead of vec_proxy_equal() because vec_proxy_equal() returns a vctr designed for value comparison with ==, not for deep structural comparison as we want here)

With this definition the operator becomes meaningful for all vctrs record types as a "record equal" (or "deep equal"?) operator, as opposed to the usual "value equal". Cool.

For ivs, for example, this would allow you to compare intervals with one or more bounds missing. Interval vectors essentially can have 3 types of structural missingness: missing start, missing end, and missing both. So you'd have:

iv(5, 5) %==% iv(5, 5) # TRUE
iv(5, NA) %==% iv(5, NA) # TRUE
iv(NA, 5) %==% iv(NA, 5) # TRUE
iv(NA, NA) %==% iv(NA, NA) # TRUE

iv(5, NA) %==% iv(NA, NA) # FALSE
# Etc.

Compared to behavior of regular ==:

iv(5, 5) == iv(5, 5) # TRUE
iv(5, NA) == iv(5, NA) # NA
iv(NA, 5) == iv(NA, 5) # NA
iv(NA, NA) == iv(NA, NA) # NA

iv(5, NA) == iv(NA, NA) # NA
# Etc.

This makes sense because when we test iv(5, NA) == iv(5, NA), we're asking "were these the same measured interval?" and the correct answer is "we don't know, because we forgot what the upper bound was... so we can't say either way". Whereas when we ask iv(5, NA) %==% iv(5, NA), we're asking "are these record entries equivalent?" and the answer is "yes, because they both have the same lower bound and are both missing the upper bound".

Of course right now in ivs when either the start or end value of the input is NA you force both to be NA (to avoid needing to deal with the complexity of incomplete intervals, which is totally valid). Still, it's interesting to think about applications of this operator outside of interlacer... If this has more broad application, it might be worth considering for inclusion into vctrs.

Anyway, thanks for thinking this thru with me! I'll continue to stew, and pls don't hesitate to tag me if you see any more applications or parallels on anything else you're working on...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants